{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LLM Finetuning using AutoTrain Advanced\n",
    "\n",
    "In this notebook, we will finetune a llama-3.2-1b-instruct model using AutoTrain Advanced.\n",
    "You can replace the model with any Hugging Face transformers compatible model and dataset with any other dataset in proper formatting.\n",
    "For dataset formatting, please take a look at [docs](https://huggingface.co/docs/autotrain/index)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from autotrain.params import LLMTrainingParams\n",
    "from autotrain.project import AutoTrainProject"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "HF_USERNAME = \"your_huggingface_username\"\n",
    "HF_TOKEN = \"your_huggingface_write_token\" # get it from https://huggingface.co/settings/token\n",
    "# It is recommended to use secrets or environment variables to store your HF_TOKEN\n",
    "# your token is required if push_to_hub is set to True or if you are accessing a gated model/dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = LLMTrainingParams(\n",
    "    model=\"meta-llama/Llama-3.2-1B-Instruct\",\n",
    "    data_path=\"HuggingFaceH4/no_robots\", # path to the dataset on huggingface hub\n",
    "    chat_template=\"tokenizer\", # using the chat template defined in the model's tokenizer\n",
    "    text_column=\"messages\", # the column in the dataset that contains the text\n",
    "    train_split=\"train\",\n",
    "    trainer=\"sft\", # using the SFT trainer, choose from sft, default, orpo, dpo and reward\n",
    "    epochs=3,\n",
    "    batch_size=1,\n",
    "    lr=1e-5,\n",
    "    peft=True, # training LoRA using PEFT\n",
    "    quantization=\"int4\", # using int4 quantization\n",
    "    target_modules=\"all-linear\",\n",
    "    padding=\"right\",\n",
    "    optimizer=\"paged_adamw_8bit\",\n",
    "    scheduler=\"cosine\",\n",
    "    gradient_accumulation=8,\n",
    "    mixed_precision=\"bf16\",\n",
    "    merge_adapter=True,\n",
    "    project_name=\"autotrain-llama32-1b-finetune\",\n",
    "    log=\"tensorboard\",\n",
    "    push_to_hub=True,\n",
    "    username=HF_USERNAME,\n",
    "    token=HF_TOKEN,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If your dataset is in CSV / JSONL format (JSONL is most preferred) and is stored locally, make the following changes to `params`:\n",
    "\n",
    "```python\n",
    "params = LLMTrainingParams(\n",
    "    data_path=\"data/\", # this is the path to folder where train.jsonl/train.csv is located\n",
    "    text_column=\"text\", # this is the column name in the CSV/JSONL file which contains the text\n",
    "    train_split = \"train\" # this is the filename without extension\n",
    "    .\n",
    "    .\n",
    "    .\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this will train the model locally\n",
    "project = AutoTrainProject(params=params, backend=\"local\", process=True)\n",
    "project.create()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "autotrain",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
